Introduction

The primary goal of our project is to determine if we can closely predict movie ratings before critic ratings are released and the films are given an average user score on IMDb. Being able to determine whether or not a movie will be a quality release without critic ratings could address a number of business problems that surround the financing and marketing strategies for different types of movies. Film studios and movie theatres could use these insights to budget accordingly and properly advertise movies to maximize popularity, ratings, and profits. The dataset we will be using contains 5,043 movies, and 28 variables related to the movie.

The primary research questions we wish to ask revolve around predicting a movie’s IMDb score. One of which is will the number of faces in a movie poster correlate with the movie’s rating? Can the multiple actor variables and the director variable be useful in predicting a movie’s IMDb score? Does a movie’s parental guidance rating (meaning R, PG-13, etc.) play into how well a movie scores? Does a movie’s budget help determine a movie’s critical success? Do certain genres tend to receive higher critic scores?

The secondary questions we are interested in solving revolve around being able to determine a movie’s revenue based on the following criteria: Do different genres of movie play into a film’s revenue total? Do critic scores determine how much a movie grosses? Does the number of faces on a poster play into how much a movie grosses? How much do popular actors and directors determine a movie’s financial success?

Tertiary questions we wish to investigate are general questions about how variables relate to each other. We wish to find out if different nation’s films tend to have different duration times. We would like to see how gross compares to budget and determine what budget expenditure typically leads to the best returns ratio-wise. We also would like to determine what key plot words are used most commonly, and if they lead to commercial and critical success.

Data Source and Collection

Our data was retrieved as a CSV file from Kaggle:

https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

The data was retrieved by scraping 28 values for 5043 movies from the IMDb website and by using facial recognition software run on 4906 posters. The movies’ release dates span over 100 years and the movies originate from 66 different countries. The variables of the dataset are as follows:

Some of our data will need to be cleansed before moving on to analysis. The variable “director_facebook_likes” is inaccurate in many instances. Foreign films using currencies different from USD may have a misrepresented gross revenue, because the values aren’t equivalent with foreign currencies. This can be addressed by transforming these values by an exchange rate. We would also like to retrieve the release date for each film, since that variable was not included in the original dataset.

The goal of our project is to find connections between variables and the film’s IMDb score. We believe that the key variables are the following: facenumber_in_poster, director_name, actor_1_name, actor_2_name, and actor_3_name. Our hypothesis is that these variables will be crucial in predicting a movie’s overall critic score. We also hypothesize that budget, gross, actors, directors, genre and content rating will play into how much a movie grosses in revenue.

Preperation work

Clear the environment

rm(list = ls())

Load in Packages

# Note: User may need install packages if they have not previously used them
library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plotly) 
## 
## Attaching package: 'plotly'
## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(corrplot)

Read in data

Set Working Directory

setwd("D:/College Work/Junior Year/IST 3420/Project/Files")

Get CSV data loaded in

movies <- read.csv("movie_metadata CSV.csv", header = TRUE)

Convert movies to data frame

movie_df <- as.data.frame(movies)

show structure of movie_df

str(movie_df)
## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 1 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 35 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 1 2714 1970 2163 3018 2941 ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 1 651 4745 29 1142 2005 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 4918 2476 2526 2458 4546 2551 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 1 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 1 65 65 65 65 63 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 1 10 10 9 10 9 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

Data cleansing and manipulation

Most of the variables for the IMDb data set were formatted correctly to begin with, but some cleansing was required.

We created a new variable that calculated the gross profit of a film by subtracting the budget of the film from the gross revenue. This will allow us to perform analyses later on and see how other factors may affect the profit of movies.

create new variable gross_profit by subtracting budget from gross

# Create new vector, positive means budget was lower than gross, negative means that budget exceeds gross
gross_profit <- movie_df$gross - movie_df$budget

# bind the vector profit to the data frame
movie_df <- cbind(movie_df, gross_profit)

The values of the “movie_title” variable in our data set had the character “” appended at the end and white space at the end, so we removed the character and trimmed the white space for better readability and to prevent any issues that might have arisen without eliminating the extra white space.

# Remove the Ă‚ at the end of the titles
movie_df$movie_title <- gsub("Ă‚", "", movie_df$movie_title)

# Trim off the whitespace
movie_df$movie_title <- trimws(movie_df$movie_title)

There were also a number of duplicated values of movies, so we checked for redundant data and removed duplicate entries.

First we need to check to see if duplicated movie_titles values exist.

# Show number of unique titles
length(unique(movie_df$movie_title))
## [1] 4916
# Show number of duplicated titles
anyDuplicated(movie_df$movie_title)
## [1] 138

After verifying these are actually duplicates, We need to remove duplicates to avoid skew in our findings

# This statement removes movie titles that are duplicates
movie_df <- movie_df[!duplicated(movie_df$movie_title),]

Show summary statistics now that the dataset has been cleansed

summary(movie_df)
##               color               director_name  num_critic_for_reviews
##                  :  19                   : 102   Min.   :  1           
##   Black and White: 204   Steven Spielberg:  26   1st Qu.: 49           
##  Color           :4693   Woody Allen     :  22   Median :108           
##                          Clint Eastwood  :  20   Mean   :138           
##                          Martin Scorsese :  20   3rd Qu.:191           
##                          Ridley Scott    :  16   Max.   :813           
##                          (Other)         :4710   NA's   :49            
##     duration     director_facebook_likes actor_3_facebook_likes
##  Min.   :  7.0   Min.   :    0.0         Min.   :    0.0       
##  1st Qu.: 93.0   1st Qu.:    7.0         1st Qu.:  132.0       
##  Median :103.0   Median :   48.0         Median :  366.0       
##  Mean   :107.1   Mean   :  691.0         Mean   :  631.3       
##  3rd Qu.:118.0   3rd Qu.:  189.8         3rd Qu.:  633.0       
##  Max.   :511.0   Max.   :23000.0         Max.   :23000.0       
##  NA's   :15      NA's   :102             NA's   :23            
##           actor_2_name  actor_1_facebook_likes     gross          
##  Morgan Freeman :  18   Min.   :     0         Min.   :      162  
##  Charlize Theron:  14   1st Qu.:   607         1st Qu.:  5019656  
##                 :  13   Median :   982         Median : 25043962  
##  Brad Pitt      :  13   Mean   :  6494         Mean   : 47644515  
##  Meryl Streep   :  11   3rd Qu.: 11000         3rd Qu.: 61108413  
##  Adam Sandler   :  10   Max.   :640000         Max.   :760505847  
##  (Other)        :4837   NA's   :7              NA's   :862        
##                   genres                actor_1_name  movie_title       
##  Drama               : 233   Robert De Niro   :  48   Length:4916       
##  Comedy              : 205   Johnny Depp      :  36   Class :character  
##  Comedy|Drama        : 189   Nicolas Cage     :  32   Mode  :character  
##  Comedy|Drama|Romance: 185   Denzel Washington:  29                     
##  Comedy|Romance      : 157   J.K. Simmons     :  29                     
##  Drama|Romance       : 150   Matt Damon       :  29                     
##  (Other)             :3797   (Other)          :4713                     
##  num_voted_users   cast_total_facebook_likes         actor_3_name 
##  Min.   :      5   Min.   :     0                          :  23  
##  1st Qu.:   8362   1st Qu.:  1395            Steve Coogan  :   8  
##  Median :  33133   Median :  3049            Ben Mendelsohn:   7  
##  Mean   :  82645   Mean   :  9580            Jon Gries     :   7  
##  3rd Qu.:  93773   3rd Qu.: 13617            Robert Duvall :   7  
##  Max.   :1689764   Max.   :656730            Sam Shepard   :   7  
##                                              (Other)       :4857  
##  facenumber_in_poster               plot_keywords 
##  Min.   : 0.000                            : 152  
##  1st Qu.: 0.000       based on novel       :   4  
##  Median : 1.000       one word title       :   3  
##  Mean   : 1.377       after dark horrorfest:   2  
##  3rd Qu.: 2.000       color in title       :   2  
##  Max.   :43.000       two word title       :   2  
##  NA's   :13           (Other)              :4751  
##                                              movie_imdb_link
##  http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1:   1  
##  http://www.imdb.com/title/tt0011549/?ref_=fn_tt_tt_1:   1  
##  http://www.imdb.com/title/tt0015624/?ref_=fn_tt_tt_1:   1  
##  http://www.imdb.com/title/tt0017136/?ref_=fn_tt_tt_1:   1  
##  http://www.imdb.com/title/tt0018737/?ref_=fn_tt_tt_1:   1  
##  http://www.imdb.com/title/tt0019729/?ref_=fn_tt_tt_1:   1  
##  (Other)                                             :4910  
##  num_user_for_reviews     language         country       content_rating
##  Min.   :   1.0       English :4582   USA      :3710   R        :2067  
##  1st Qu.:  64.0       French  :  73   UK       : 434   PG-13    :1411  
##  Median : 153.0       Spanish :  40   France   : 154   PG       : 686  
##  Mean   : 267.7       Hindi   :  28   Canada   : 124            : 300  
##  3rd Qu.: 320.5       Mandarin:  24   Germany  :  94   Not Rated: 115  
##  Max.   :5060.0       German  :  19   Australia:  53   G        : 112  
##  NA's   :21           (Other) : 150   (Other)  : 347   (Other)  : 225  
##      budget            title_year   actor_2_facebook_likes   imdb_score   
##  Min.   :2.180e+02   Min.   :1916   Min.   :     0         Min.   :1.600  
##  1st Qu.:6.000e+06   1st Qu.:1999   1st Qu.:   277         1st Qu.:5.800  
##  Median :1.985e+07   Median :2005   Median :   593         Median :6.600  
##  Mean   :3.655e+07   Mean   :2002   Mean   :  1622         Mean   :6.437  
##  3rd Qu.:4.300e+07   3rd Qu.:2011   3rd Qu.:   912         3rd Qu.:7.200  
##  Max.   :4.200e+09   Max.   :2016   Max.   :137000         Max.   :9.500  
##  NA's   :484         NA's   :106    NA's   :13                            
##   aspect_ratio    movie_facebook_likes  gross_profit       
##  Min.   : 1.180   Min.   :     0       Min.   :-4.200e+09  
##  1st Qu.: 1.850   1st Qu.:     0       1st Qu.:-1.020e+07  
##  Median : 2.350   Median :   159       Median : 7.517e+05  
##  Mean   : 2.222   Mean   :  7348       Mean   : 8.693e+06  
##  3rd Qu.: 2.350   3rd Qu.:  2000       3rd Qu.: 2.428e+07  
##  Max.   :16.000   Max.   :349000       Max.   : 5.235e+08  
##  NA's   :326                           NA's   :1127

Correlation

We chose to use Pearson Correlation, because the measures are not ordinal, and measures are linear

Pearson Correlation Table 1

# Create data frame with only numeric types to see how variables correlate to imdb score 
numeric_df <- select_if(movie_df, is.numeric) # Other option cor(movie_df[sapply(movie_df, is.numeric)])

# Creates a correlation table showing variables correlate w/ each other by number and text color 
num_correlation <- round(cor(numeric_df, use = "pairwise.complete.obs", method = "pearson"),1)
corrplot(num_correlation, method = "number", type = )

# Note: After running this chunk, you can also select hte num_correlation item from the global environment to view a generic table

Pearson Correlation Table 2

# Creates a correlation table showing variables correlate w/ each other by color and  circle size
corrplot(num_correlation, method = "circle")

Linear Regresssion Models based on correlation

Linear Regression Model based on how num_voted_users relates to score

# Create random and representative sample data frame so there is not an overwheliming amount of values 
small_movie_df <- movie_df[sample(1:nrow(movie_df), 500, replace = F), ]

# Linear regression using ggplot2
ggplot(small_movie_df, aes(x = num_voted_users, y = imdb_score)) +
  geom_point(shape = 1) +
  geom_smooth(method = lm) +
  ggtitle("Voted User & Score Linear Regression")+
  ylab("IMDB Score") +
  xlab("Voted Users")

Based on this regression plot, it appears that there is a moderate positive correlation between the amount of users who have voted on a movie and its overall score. There is a cluster where presumable niche or indie movies with a smaller amount of user votes have a range of varied scores, but as the number of user votes increases, the IMDb score approaches 10 (the maximum possible score). This may actually be because better movies would be seen by larger audiences over time and would therefore receive more votes.

Number of movie facebook likes seems closely related with # of critic reviews, let’s take a look

ggplot(small_movie_df, aes(x = movie_facebook_likes, y = num_critic_for_reviews)) +
  geom_point(shape = 1) +
  geom_smooth(method = lm, color = "Green") +
  ggtitle("Facebook likes and # of critc review regression") +
  ylab("# of Critic Reviews") +
  xlab("Move facebook likes")
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
## Warning: Removed 3 rows containing missing values (geom_point).

Similar to the previous regression of user votes and IMDb score, there appears to be a cluster with a large spread initially where the number of movie likes on Facebook isn’t a strong indicator of the number of critic reviews, but a moderate positive correlation overall between the two exists. This may be because a given movie’s overall popularity could attract more of both critics and the general audience alike.

Data Visualization and analysis

We can look to see which years tend to have higher quality movies based on average IMDb scores of movies over time.

# Create data frame with year and average score
yearly_score <- ddply(movie_df, c("title_year"), summarise,
                      mean(imdb_score, na.rm = TRUE))
colnames(yearly_score) <- c("title_year", "avg_score")

# Plotly line graph that shows average imdb scores across the years
plot_ly(yearly_score, x = yearly_score$title_year, y = yearly_score$avg_score, type = "scatter", mode = "lines") %>% 
  layout(title = "Year and Score")

Up until the 1970s, only a few movies may be registered in the IMDb database for a given year, with many years only yielding a single movie. This leads to large variances between years up until the 1970s, such as how a single movie in 1920 is completely representative of the year’s extremely low average movie IMDb score of 4.8, only to surge the exisitng data point because a single movie performed better in 1925 and received a rating of 8.3. Regardless, the general trend appears to be a declining average IMDb score over time. This may not reflect the quality of large production films, but rather reduced barriers to entry have enabled a surge in indie films to enter the market and the fact that the IMDb database can also register these films with the rise of internet-based titles being published on platforms like YouTube.

We can explore which actors have the highest average movie score overall.

# Find actors with highest average movie score 
actor_rating_df <- ddply(movie_df, c("actor_1_name"), summarise, # Pull out actor 1 and rating data
                      actor_m <- mean(imdb_score, na.rm = TRUE),  # Find the mean of actors and discount na values
                      number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(actor_rating_df) <- c("actor_name", "avg_score", "num_of_movies") # Rename headers
actor_avg <- actor_rating_df[which(actor_rating_df$num_of_movies >= 7),]# Actors in at least 7 movies to avoid skew
actor_avg <- actor_avg[which(actor_avg$avg_score >= 7.15),] # Show only highest scoring actors for readability

# Convert actor_name to a factor
actor_avg$actor_name <- factor(actor_avg$actor_name)

# Show highest to lowest average score
actor_avg$actor_name <- reorder(actor_avg$actor_name, actor_avg$avg_score)

# Visualization of highest scoring actors
ggplot(actor_avg, aes(x = avg_score, y = actor_name)) + # Sets x to average score and y to the name
  geom_point(aes(colour = actor_name)) + # Makes a point graph with different colors per actor
  theme(axis.text = element_text(size = 10)) + # Sets font size
  ggtitle("Actor Scores") + # Changes chart title
  xlab("Average Score") + # Changes x field title
  ylab("Actor Name")  # Changes y field table

We can see a visual representation of actors with the highest average IMDb ratings for the films they star in, with the top 3 being Leonardo DiCaprio, Tom Hanks, and Clint Eastwood.

We can explore which directors have the highest average movie score overall.

# Find actors with highest average movie score 
dir_rating_df <- ddply(movie_df, c("director_name"), summarise, # Pull out director name and rating data
                      dir_m <- mean(imdb_score, na.rm = TRUE),  # Find the mean of directors and discount na values
                      number <- length(na.omit(imdb_score))) # Count number of movies, but omit na values
colnames(dir_rating_df) <- c("director_name", "avg_score", "num_of_movies") # Rename headers
dir_avg <- dir_rating_df[which(dir_rating_df$num_of_movies >= 7),]# Direct  at least 7 movies to avoid skew
dir_avg <- dir_avg[which(dir_avg$avg_score >= 7.5),] # Show only highest scoring directors for readability
dir_avg <- dir_avg[which(dir_avg$director_name != ''),] # Exclude movies who have no director name 

# Convert actor_name to a factor
dir_avg$director_name <- factor(dir_avg$director_name)

# Show highest to lowest average score
dir_avg$director_name <- reorder(dir_avg$director_name, dir_avg$avg_score)

# Visualization of highest scoring actors
ggplot(dir_avg, aes(x = avg_score, y = director_name)) + # Sets x to average score and y to the name
  geom_point(aes(colour = director_name)) + # Makes a point graph with different colors per director
  theme(axis.text = element_text(size = 10, colour = 'black')) + # Sets font size
  ggtitle("Director Scores") +  # Changes chart title
  xlab("Average Score") + # Changes x field title
  ylab("Director Name") # Changes y field table

We can see a visual representation of directors with the highest average IMDb ratings for the films they star in, with the top 3 being Christopher Nolan, Quentin Tarantino, and James Cameron.

We can explore the average IMDb score for each content rating.

content_df <- ddply(movie_df, c("content_rating"), summarise, # New DF w/ rating, avg score and # of movies 
                    mean(imdb_score, na.rm = TRUE), # Creates column containing mean scores
                    length(na.omit(imdb_score))) # Omits scores with NA values
colnames(content_df) <- c("content_rating", "avg_score", "number_of_films") # Rename columns
content_df <- content_df[which(content_df$content_rating != ""),] # Eliminate films from analysis without rating

# Visualization of average score by content rating
ggplot(content_df, aes(x = content_rating, y = avg_score)) + # Bar graph with x as rating and y as score
  geom_bar(stat = "identity", width = .5, fill = 'deepskyblue3') + # Creates space between bars and changes color
  theme(axis.text = element_text(size = 8, colour = 'black', angle = 90)) + # Sets text size to 8, text color to black, and angles text for separation purposes
  ggtitle("Content Scores") + # Title of graph
  xlab("Content Rating") + # Title of x axis
  ylab("IMDB Score") # Title of y axis

Based on this plot, we can see the highest performing category of movies based on content rating is TV-MA (intended for a mature audience), with the lowest performing category being PG-13.This could possibly be the effect of restricted possibilities of movies that target a PG-13 audience not being as favored by mature audiences who watch both TV-MA movies and PG-13 movies.

We can show the spread of scores per content rating category with box plots to detect if outliers could be affecting the score averages of each content rating category.

# Create data frame that excludes movies with no content rating 
content_df2 <- movie_df[which(movie_df$content_rating != ''), ]

#Use plotly for great boxplot capabilities, note: Might need to install "plotly"
plot_ly(content_df2, x = content_df2$imdb_score, color = content_df2$content_rating, type = "box") %>%
  layout(title = "Movie Type and Score Range")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Based on these box plots, we can see a number of potential outliers in several categories. There also appear to be clusters of values picked up as outliers in the content rating categories of R, PG-13, and PG, which may indicate bimodal distributions in these categories. These outliers will need to be accounted for when considering the content rating’s effect on the average movie score.